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Abstract 

Bayesian optimization with Gaussian processes has become an increas- 
ingly popular tool in the machine learning community. It is efficient and 
can be used when very little is known about the objective function, making 
r/2 it popular in expensive black-box optimization scenarios. It uses Bayesian 

O methods to sample the objective efficiently using an acquisition function 

which incorporates the model's estimate of the objective and the uncer- 
tainty at any given point. However, there are several different parameter- 
^ ized acquisition functions in the literature, and it is often unclear which 

one to use. Instead of using a single acquisition function, we adopt a 
t-H portfolio of acquisition functions governed by an online multi-armed ban- 

dit strategy. We propose several portfolio strategies, the best of which 
we call GP-Hedge, and show that this method outperforms the best indi- 
vidual acquisition function. We also provide a theoretical bound on the 
f^**) algorithm's performance. 

o 

1 Introduction 

Bayesian optimization is a powerful strategy for finding the extrema of objective 
functions that are expensive to evaluate. It is applicable in situations where one 
does not have a closed-form expression for the objective function, but where one 
can obtain noisy evaluations of this function at sampled values. It is particu- 
larly useful when these evaluations are costly, when one does not have access to 
derivatives, or when the problem at hand is non-convex. Bayesian optimization 
has two key ingredients. First, it uses the entire sample history to compute 
a posterior distribution over the unknown objective function. Second, it uses 
an acquisition function to automatically trade off between exploration and ex- 
ploitation when selecting the points at which to sample next. As such, Bayesian 
optimization techniques are some of the most efficient approaches in terms of 
the number of function evaluations required [26l |20l H2 H] . In recent years, 
the machine learning community has increasingly used Bayesian optimization 
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to optimize expensive objective functions. Examples can be found in robot gait 
design (53] , online path planning US] , intelligent user interfaces for anima- 
tion OH], algorithm configuration [17], efficient MCMC [28], sensor placement 
[3"T1 |2"7] , and reinforcement learning [5] . 

However, the choice of acquisition function is not trivial. Several different 
methods have been proposed in the literature, none of which work well for 
all classes of functions. Building on recent developments in the field of online 
learning and multi-armed bandits [8], this paper proposes a solution to this 
problem. The solution is based on a hierarchical hedging approach for managing 
an adaptive portfolio of acquisition functions. 

We review Bayesian optimization and popular acquisition functions in Sec- 
tion 2. In Section 3, we propose the use of various hedging strategies for Bayesian 
optimization [2] [9]. In Section 4, we present experimental results using stan- 
dard test functions from the literature of global optimization. The experiments 
show that the proposed hedging approaches outperform any of the individual 
acquisition functions. We also provide detailed comparisons among the hedging 
strategies. Finally, in Section 5 we present a bound on the cumulative regret 
which helps provide some intuition as to algorithm's performance. 

2 Bayesian optimization 

We are concerned with the task of optimization on a e?-dimensional space: 
max xgj4 cr /(*)■ 

We define x t as the rth sample and yt — f(x-t) + £t, with t t ~ A/"(0, it 2 ), as 
a noisy observation of the objective function at x t . Other observation models 
are possible [51 [TU1 [TSJ [3D] , but we will focus on real, Gaussian observations for 
ease of presentation. 

The Bayesian optimization procedure is shown in Algorithm [l] As men- 
tioned earlier, it has two components: the posterior distribution over the ob- 
jective and the acquisition function. Let us focus on the posterior distribution 
first and come back to the acquisition function in Section |2.2| As we accu- 
mulate observation^ T>\-t = {x-i-.u Vut}) a prior distribution P(f) is combined 
with the likelihood function P(T>i- t \f) to produce the posterior distribution: 
P(f\T>i : t) oc P(T>i- t \f)P(f). The posterior captures the updated beliefs about 
the unknown objective function. One may also interpret this step of Bayesian 
optimization as estimating the objective function with a surrogate function (also 
called a response surface). We will place a Gaussian process (GP) prior on /. 
Other nonparametric priors over functions, such as random forests, have been 
considered [SJ, but the GP strategy is the most popular alternative. 

1 Here we use subscripts to denote sequences of data, i.e. y\-t = {yi, ■ ■ ■ ,Vt}- 
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Algorithm 1 Bayesian Optimization 



1: for t = 1,2,... do 

2: Find xt by optimizing the acquisition function over the GP: xj = argmax x «(x|Di : t_i). 

3: Sample the objective function: yt = /(xt) + tt- 
4: Augment the data T>ut = {2?i:t— l, (xt,yt)}- 

5: end for 



2.1 Gaussian processes 

The objective function is distributed according to a GP prior: 

/(x)~GP(m(x),A;(x i ,x i )). 

For convenience, and without loss of generality, we assume that the prior mean 
is the zero function (but see [351 1211 H] for examples of nonzero means) . This 
leaves us the more interesting question of defining the covariance function. A 
very popular choice is the squared exponential kernel with a vector of automatic 
relevance determination (ARD) hyperparameters 6 [25] : 

k(Xi,Xj) = exp (- §(xi - x,-) T diag(0)~ 2 (xi - x,-)), 

where diag(0) is a diagonal matrix with entries 6 along the diagonal and zeros 
elsewhere. The choice of hyperparameters will be discussed in the experimental 
section, but we note that it is not trivial in this domain because of the paucity 
of data. For an in depth analysis of this issue we refer the reader to e.g. [4l [27]. 

We can sample the GP at t points by choosing the indices {x 1:t } and sam- 
pling the values of the function at these indices to produce the data T>i- t . The 
function values are distributed according to a multivariate Gaussian distribution 
A/"(0,K), with covariance entries fc(xj,Xj). Assume that we already have the 
observations, say from previous iterations, and that we want to use Bayesian 
optimization to decide what point x t+ i should be considered next. Let us de- 
note the value of the function at this arbitrary point as ft+i- Then, by the 
properties of GPs, /i :f and f t +i are jointly Gaussian: 



Jut 
ft+i 



AM o, 



K k 

k T fc(x t+1 ,x t+1 ) 



where k = [k(x t +i, xi), fc(x f+ i, X2), . . . , /c(x t+ i, x t )]. Using the Sherman-Morrison- 
Woodbury formula, sec [29 for a comprehensive treatment, one can easily arrive 
at an expression for the predictive distribution: 



where 



P(t/t+i|X>i :t ,x t+ i) = Af(pt{x t +i),cr t (xt +1 ) + a ), 

/i t (x i+1 )=k T [K + (T 2 I]- 1 yi:t ) 

a t 2 (x m ) = fe(x t+1 ,x t+1 ) - k T [K + a 2 !}- 1 ^ 

In this sequential decision making setting, the number of query points is rela- 
tively small and, consequently, the GP predictions are easy to compute. 
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Figure 1: Acquisition functions with different values of the exploration parameters v 
and £ . The GP posterior is shown at the top. The other images show the acquisition 
functions for that GP. From the top: Probability of improvement, expected improve- 
ment and upper confidence bound. The maximum of each acquisition function, where 
the GP is to be sampled next, is shown with a triangle marker. Note the increased 
preference for exploration exhibited by GP- U CB. 
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2.2 Acquisition functions 

The role of the acquisition function is to guide the search for the optimum. 
Typically, acquisition functions are defined such that high values correspond to 
potentially high values of the objective function, whether because the prediction 
is high, the uncertainty is great, or both. The acquisition function is maximized 
to select the next point at which to evaluate the objective function. That is, 
we wish to sample the objective function at argmax x u(n\T>). This auxiliary 
maximization problem, where the objective is known and easy to evaluate, can 
be easily carried out with standard numerical techniques such as multistart or 
DIRECT [Till [T3]. The acquisition function is sometimes called the infill or 
simply the "utility" function. In the following sections, we will look at the three 
most popular choices. Figure [I] shows how these give rise to distinct sampling 
behaviour. 

Probability of improvement (PI): The early work of Kushner [5T] sug- 
gested maximizing the probability of improvement over the incumbent fi + = 
max( /tt(xt). The drawback, intuitively, is that this formulation is biased toward 
exploitation only. To remedy this, practitioners often add a trade-off parameter 
£ > 0, so that 



where $(•) is the standard Normal cumulative distribution function (CDF). The 
exact choice of £ is left to the user. Kushner recommends using a (unspecified) 
schedule for £, which should start high in order to drive exploration and decrease 
towards zero as the algorithm progresses. Lizotte, however, found that using 
such a schedule did not offer improvement over a constant value of £ on a suite 
of test functions [53] . 

Expected improvement (EI): More recent work has tended to take into 
account not only the probability of improvement, but the magnitude of the 
improvement a point can potentially yield. Mockus et al. 26J proposed max- 
imizing the expected improvement with respect to the best function value yet 
seen, given by the incumbent x + = argmax Xt /(x t ). For our Gaussian process 
posterior, one can easily evaluate this expectation, see [18] . yielding: 



where d = /i(x) — fi + — £ and where </>(•) and $(•) denote the PDF and CDF of 
the standard Normal distribution respectively. Here £ is an optional trade-off 
parameter analogous to the one defined above. 

Upper confidence bound (UCB & GP-UCB): Cox and John [IT] in- 
troduce an algorithm they call "Sequential Design for Optimization" , or SDO. 
Given a random function model, SDO selects points for evaluation based on a 
confidence bound consisting of the mean and weighted variance: /x(x) + kct(x). 
As with the other acquisition models, however, the parameter k is left to the 
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user. A principled approach to selecting this parameter is proposed by Srinivas 
et al. [31 . In this work, the authors define the instantaneous regret of the se- 
lection algorithm as r(x) = /(x*) — /(x) and attempt to select a sequence of 
weights K t so as to minimize the cumulative regret R? = r(xi) + ■ ■ ■ + r(x^). 
Using the upper confidence bound selection criterion with n t — \Jvji t and the 
hyperparameter v > Srinivas et al. define 



It can be shown that this method has cumulative regret bounded by (D(\/TPt1t) 
with high probability. Here /?t is a carefully selected learning rate and 7t is 
a bound on the information gained by the algorithm at selected points after T 
steps. Both of these terms depend upon the particular form of kernel-function 
used, but for most kernels their product can be shown to be sublincar in T. We 
refer the interested reader to the original paper [31] for exact bounds. 

The sublinear bound on cumulative regret implies that the method is no- 
regret, i.e. that limr_>. 00 Rt/T = 0. This in turn provides a bound on the 
convergence rate for the optimization process, since the regret at the maximum 
/(x*) — max t /(x t ) is upper bounded by the average regret Rt/T — /(x*) — 
^J2t=if( x t)- As we will note later, however, this bound can be quite loose in 
practice. 



Algorithm 2 GP-Hcdgc 

1: Select parameter rj S M. + . 

2: Set g % = for i = 1, ...,N. 

3: for t = 1,2,... do 

4: Nominate points from each acquisition function: xj = argmax x Ui(x\T>i [ t-i). 
5: Select nominee x t = with probability Pt(j) = <^-v{v9t—i) I Efci e]c P(V9t—i)- 
6: Sample the objective function yt = /(xt) + et- 
7: Augment the data T>\- t = {f i : t— l, (xt,j/t)}- 
8: Receive rewards rl = /xt(xj) from the updated GP. 
9: Update gains g\ = gl_i + r\. 

10: end for 



3 Portfolio strategies 

There is no choice of acquisition function that can be guaranteed to perform 
best on an arbitrary, unknown objective. In fact, it may be the case that no 
single acquisition function will perform the best over an entire optimization 
- a mixed strategy in which the acquisition function samples from a pool (or 
portfolio) at each iteration might work better than any single acquisition. This 
can be treated as a hierarchical multi-armed bandit problem, in which each of 
the N arms is an acquisition function, each of which is itself an infinite-armed 
bandit problem. In this section we propose solving the selection problem using 
three strategies from the literature, the application of which we believe to be 
novel. 



GP-UCB(x) = + 
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Hedge is an algorithm which at each time step t selects an action i with 
probability Pt{i) based on the cumulative rewards (gain) for that action (see 
Auer et al. [2]). After selecting an action the algorithm receives reward r\ 
for each action and updates the gain vector. In the Bayesian optimization 
setting, we can define N bandits each corresponding to a single acquisition 
function. Choosing action i corresponds to sampling from the point nominated 
by function ui, i.e. = argmax x u,(x|Z> 1: t_i) for i = 1, . . . , N. Finally, while in 
the conventional Bayesian optimization setting the objective function is sampled 
only once per iteration, Hedge is a full information strategy and requires a 
reward for every action at every time step. We can achieve this by defining the 
reward at x£ as the expected value of the GP model at x l t . That is, r\ — ^t(xj). 
We refer to this method as GP-Hedge. Provided that the objective function is 
smooth, this reward definition is reasonable. 

Auer et al. also propose the ExpS algorithm, a variant of Hedge that applies 
to the partial information setting. In this setting it is no longer assumed that 
rewards are observed for all actions. Instead at each iteration a reward is only 
associated with the particular action selected. The algorithm uses Hedge as a 
subroutine where rewards observed by Hedge at each iteration are r\/pt(i) for 
the action selected and zero for all actions. Here ftt(i) is the probability that 
Hedge would have selected action i. The Exp3 algorithm, meanwhile, selects 
actions from a distribution that is a mixture between pt(i) and the uniform 
distribution. Intuitively this ensures that the algorithm does not miss good 
actions because the initial rewards were low (i.e. it continues exploring). 

Finally, another possible strategy is the NormalHedge algorithm [9]. This 
method, however, is built to take advantage of situations where the number of 
bandit arms (acquisition functions) is large, and may not be a good match to 
problems where N is relatively small. 

The GP-Hedge procedure is shown in Algorithm [2] In practice any of these 
hedging strategies could be used, however in our experiments we find that Hedge 
tends to outperform the others. Note that it is necessary to optimize N acquisi- 
tion functions at each time step rather than 1. While this might seem expensive, 
this is unlikely to be a major problem in practice for small N, as (i) Bayesian 
optimization is typically employed when sampling the objective is so expensive 
as to dominate other costs; (ii) it has been shown that fast approximate opti- 
mization of u is usually sufficient [6l [22j [17] ; and (iii) it is straightforward to 
run the optimizations in parallel on a modern multicore processor. 

We will also note that the setting of our problem is somewhere "in between" 
the full and partial information settings. Consider, for example, the situation 
that all points sampled by our algorithm are "too distant" in the sense that 
the kernels evaluated at these points exert negligible influence on each other. 
In this case, we can see that only information obtained by the sampled point is 
available, and as a result GP-Hedge will be over-confident in its predictions when 
using the full-information strategy. However, this behaviour is not observed in 
practical situations because of smoothness properties, as well as our particular 
selection of acquisition functions. In the case of adversarial acquisition functions 
one might instead choose to use the Exp3 variant. 
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Figure 2: (Best viewed in colour.) Comparison of different acquisition approaches on 
three commonly used literature functions. The top plots show the mean and variance of 
the gap metric averaged over 25 trials. We note that the top two performing algorithms 
use a portfolio strategy. With N = 3 acquisition functions, GP-Hedge beats the best- 
performing acquisition function in almost all cases. With N = 9, we add additional 
instances of the three acquisition functions, but with different parameters. Despite 
the fact that these additional functions individually perform worse than the ones with 
default parameters, adding them to GP-Hedge improves performance in the long run. 
The bottom plots show an example evolution of GP-Hedge 's portfolio with N — 9 for 
each objective function. The height of each band corresponds to the probability pt{i) at 
each iteration. 



4 Experiments 

To validate the use of GP-Hedge, we tested the optimization performance on 
a set of test functions with known maxima /(x*^ To see how effective each 
method is at finding the global maximum, we use the "gap" metric [16] , defined 

as 

/(x*) - /( Xl )_ , 

where again x + is the incumbent or best function sample found up to time t. The 
gap Gt will therefore be a number between (indicating no improvement over 
the initial sample) and 1 (if the incumbent is the maximum). Note, while this 
performance metric is evaluated on the true function values, this information is 
not available to the optimization methods. 

2 Code for the optimization methods and experiments will be made available online. 



/(x+) - /( Xl ; 
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Figure 3: (Best viewed in colour.) Comparison of different hedging strategies on three 
commonly used literature functions. The top plots show the mean and variance of the 
gap metric averaged over 25 trials. Note that both Hedge and ExpS outperform the 
best single acquisition function, GP-UCB. The bottom plots show the mean average 
regret for each method (lower is better). Average regret is shown in order to compare 
with previous work J31f . however as noted in the text the gap measure provides a more 
direct comparison of optimization performance. We see that mixed strategies (i.e. GP- 
Hedge) perform comparably to GP-UCB under the regret measure and outperform this 
individual strategy under the gap measure. As the problems get harder, and with higher 
dimensionality, GP-Hedge significantly outperforms other acquisition strategies. 

4.1 Standard test functions 

We first tested performance using functions common to the literature on Bayesian 
optimization: the Branin, Hartman 3, and Hartman 6 functions. All of these are 
continuous, bounded, and multimodal, with 2, 3, and 6 dimensions respectively. 
We omit the formulae of the functions for space reasons, but they can be found 
in [22]. 

For each experiment, we optimized 25 times and computed the mean and 
variance of the gap metric over time. In these experiments we used hyperpa- 
rameters 9 chosen offline so as to maximize the log marginal likelihood of a 
(sufficiently large) set of sample points; see [2H] . We compared the standard ac- 
quisition functions using parameters suggested by previous authors, i.e. £ = 0.01 
for EI and PI, 5 = 0.1 and v = 0.2 for GP-UCB [21 EH]. For the GP-Hedge 
trials, we tested performance under using both 3 acquisition functions and 9 ac- 
quisition functions. For the 3-function variant we use the standard acquisition 
functions with default hyperparameters. The 9- function variant uses these same 
three as well as 6 additional acquisition functions consisting of: both PI and EI 
with i = 0.1 and £ = 1.0, GP-UCB with v = 0.1 and v = 1.0. While we omit 
trials of these additional acquisition functions for space reasons, these values are 
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not expected to perform as well as the defaults and our experiments confirmed 
this hypothesis. However, we are curious to see if adding known suboptimal 
acquisition functions will help or hinder GP-Hedge in practice. 

Results for the gap measure G t are shown in Figure [2j While the improve- 
ment GP-Hedge offers over the best single acquisition function varies, there is 
almost no combination of function and time step in which the 9-function GP- 
Hedge variant is not the best-performing method. The results suggest that the 
extra acquisition functions assist GP-Hedge in exploring the space in the early 
stages of the optimization process. Figure [2] also displays, for a single example 
run, how the the arm probabilities pt(i) used by GP-Hedge evolve over time. 
We have observed that the distribution becomes more stable when the acqui- 
sition functions come to a general consensus about the best region to sample. 
As the optimization progresses, exploitation becomes more rewarding than ex- 
ploration, resulting in more probability being assigned to methods that tend to 
exploit. However, note that if the initial portfolio had consisted only of these 
more exploitative acquisition functions, the likelihood of becoming trapped at 
suboptimal points would have been higher. 

In Figure [3] we compare against the other Hedging strategies introduced 
in Section [3] under both the gap measure and mean average regret. We also 
introduce a baseline strategy which utilizes a portfolio uniformly distributed over 
the same acquisition functions. The results show that mixing across multiple 
acquisition functions provides significant performance benefits under the gap 
measure, and as the problems' difficulty /dimensionality increases we see that 
GP-Hedge outperforms other mixed strategies. The uniform strategy performs 
well on the easier test functions, as the individual acquisition functions are 
reasonable. However, for the hardest problem (Hartman 6) we see that the 
performance of the naive uniform strategy degrades. NormalHedge performs 
particularly poorly on this problem. We observed that this algorithm very 
quickly collapses to an exclusively exploitative portfolio which becomes very 
conservative in its departures from the incumbent. We again note that this 
strategy is intended for large values of N, which may explain this behaviour. 

In the case of the regret measure we see that the hedging strategies perform 
comparable to GP-UCB, a method designed to optimize this measure. We 
further note that although the average regret can be seen as a lower-bound 
on the convergence of Bayesian optimization methods, this bound can be loose 
in practice. Further, in the setting of Bayesian optimization we are typically 
concerned not with the cumulative regret during optimization, but instead with 
the regret incurred by the incumbent after optimization is complete. Similar 
notions of "simple regret" have been studied in [TJ [7] • 

Based on the performance in these experiments, we use Hedge as the under- 
lying algorithm for GP-Hedge in the remainder of the experiments. 

4.2 Sampled test functions 

As there is no generally-agreed-upon set of test functions for Bayesian optimiza- 
tion in higher dimensions, we seek to sample synthetic functions from a known 
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Figure 4: (Best viewed in colour.) We compare the performance of the acquisition 
approaches on synthetic functions sampled from a GP prior with randomly initialized 
hyperparameters. Shown are the mean and variance of the gap metric over 25 sampled 
functions. Here, the variance is a relative measure of how well the various algorithms 
perform while the functions themselves are varied. While the variance is high (which 
is to be expected over diverse functions) , we can see that GP-Hedge is at least com- 
parable to the best acquisition functions and ultimately superior for both N = 3 and 
N = 9. We also note that for the 10D and 20D experiments GP-UCB performs quite 
well but suffers in the 40D experiment. This helps to confirm our hypothesis that a 
mixed strategy is particularly useful in situations where we do not possess strong prior 
information with regards to the choice of acquisition function. 



GP prior similar to |22| . For further details on how these functions are sampled 
see Appendix |B) As can be seen in Figure [4j GP-Hedge with N = 9 is again the 
best-performing method, which becomes even more clear as the dimensionality 
increases. Interestingly, the worst-performing function changes as dimension- 
ality increases. In the 40D experiments, GP-UCB, which generally performed 
well in other experiments, does quite poorly. Examining the behaviour, it ap- 
pears that by trying to minimize regret instead of maximizing improvement, 
GP-UCB favours regions of high variance. However, since a 40D space remains 
extremely sparsely populated even with hundreds of samples, the vast majority 
of the space still has high variance, and thus high acquisition value. 

4.3 Control of a particle simulation 

We also applied these methods to optimize the behavior of a simulated physical 
system in which the trajectories of falling particles are controlled via a set of 
repelling forces. This is a difficult, nonlinear control task whose resulting objec- 
tive function exhibits fairly isolated regions of high value surrounded by severe 
plateaus. Briefly, the four-dimensional state-space in this problem consists of a 
particle's 2D position and velocity (p,p) with two-dimensional actions consisting 
of forces which act on the particle. Particles are also affected by gravity and a 
frictional force resisting movement. The goal is to direct the path of the particle 
through regions of the state space with high reward r(p) so as to maximize the 
total reward accumulated over many time-steps. In our experiments we use a 
finite, but large, time-horizon H. In order to control this system we employ 
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a set of "repellers" each of which is located at some position Cj = (<2j,&i) and 
has strength Wi (see the left- most plot of Figure [5]). The force on a particle at 
position p is a weighted sum of the individual forces from all repellers, each of 
which is inversely proportional to the distance p — Ci. For further details we 
refer the reader to [T5] . 

This problem can be formulated in the setting of Bayesian optimization by 
defining the vector of repeller parameters x = (w±,ai,b±, . ..). In the experi- 
ments shown in Figure [5] we will utilize three repellers, resulting in a 9D opti- 
mization task. We can then define our objective as the total ff-step expected 
reward /(x) = E[5^ n=0 r (Pn)l x ] ■ Finally, since the model defines a probability 
distribution P x (po : _ff) over particle trajectories we can obtain a noisy estimate of 
this objective function by sampling a single trajectory and evaluating the sum 
over its immediate rewards. 

Results for this optimization task are shown in Figure[5j As with the previous 
synthetic examples GP-Hedge outperforms each of its constituent methods. We 
further note the particularly poor performance of PI on this example, which in 
part we hypothesize is a result of plateaus in the resulting objective function. 
In particular PI has trouble exploring after it has "locked on" to a particular 
mode, a fact that seems exacerbated when there are large regions with very 
little change in objective. 



5 Convergence behaviour 

Properly assessing the convergence behaviour of hedging algorithms of this type 
is very problematic. The main difficulty lies with the fact that decisions made at 
iteration t affect the state of the problem and the resulting rewards at all future 
iterations. As a result we cannot relate the regret of our algorithm directly to 
the regret of the best underlying acquisition strategy: had we actually used the 
best underlying strategy we would have selected completely different points {8j 
section 7.11]. 

Regret bounds for the underlying GP-UCB algorithm have been shown pH] . 
Starting with Auer et al. we also have regret bounds for the hedging strategies 
used to select between acquisition functions [2] (improved bounds can also be 
found in [8]). However, because of the points stated in the previous paragraph, 
and expounded in more detail in the appendix, we cannot simply combine both 
regret bounds. 

With these caveats in mind we will consider a slightly different algorithmic 
framework. In particular we will consider rewards at iteration t given by the 
mean /xt_i(x t ), where this assumption is made merely to simplify the following 
proof. We will also assume that GP-UCB is included as one of the possible 



acquisition functions due to its associated convergence results (see Section 2.2) 



In this scenario we can obtain the following bound on our cumulative regret. 

Theorem 1. Assume GP-Hedge is used with a collection of acquisition strate- 
gies, one of which is GP- U CB with parameters /3 t . // we also have a bound jt 
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on the information gained at points selected by the algorithm after T iterations, 
then with probability at least 1 — 5 the cumulative regret is bounded by 

T 

Rt < y/TCifayr + [^Affi-i^ 08 )] +0(Vf), 
t=i 

where x^ B is the tth point proposed by GP-UCB. 

We give a full proof of this theorem in the appendix. We will note that this 
theorem on its own does not guarantee the convergence of the algorithm, i.e. 
that limT->oo Rt /T = 0. We can see, however, that our regret is bounded by 
two sub-linear terms and an additional term which depends on the information 
gained at points proposed, but not necessarily selected. In some sense this 
additional term depends on the proximity of points proposed by GP-UCB to 
points previously selected, the expected distance of which should decrease as 
the number of iterations increases. 



6 Conclusions and future work 

Hedging strategies are a powerful tool in the design of acquisition functions 
for Bayesian optimization. In this paper we have shown that strategies that 
adaptively modify a portfolio of acquisition functions often perform substan- 
tially better — and almost never worse — than the best-performing individual 
acquisition function. Our experiments have also shown that full-information 
strategies are able to outperform partial-information strategies in many situa- 
tions. However, partial-information strategies can be beneficial in instances of 
high TV or in situations where the acquisition functions provide very conflicting 
advice. Evaluating these tradeoffs is an interesting area of future research. 

Finally, while the EI and PI acquisition functions can perform well in prac- 
tice, there currently exist no regret bounds for these approaches. In this work 
we give a regret bound for our hedging strategy by relating its performance 
to existing bounds for GP-UCB. Although our bound does not guarantee con- 
vergence it does provide some intuition as to the success of hedging methods 
in practice. Another interesting line of future research involves finding similar 
bounds for the gap measure. 
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A Proof of Theorem [T] 



We will consider a portfolio-based strategy using rewards r t — /it_i(x t ) and 
selecting between acquisition functions using the Hedge algorithm. In order to 
discuss this we will need to write the gain over T steps, in hindsight, that would 
have been obtained had we used strategy i, 

T T 

t=l t=l 

We must emphasize however that this gain is conditioned on the actual decisions 
made by Hedge, namely that points {xi, . . . ,x t _i} were selected by Hedge. If 
we define the maximum strategy g™ ax = max^ g l T we can then bound the regret 
of Hedge with respect to this gain. 

Lemma 1. With probability at least 1 — 8\ and for a suitable choice of Hedge 
parameters, r\ = -\/8 lnfc/T, the regret is bounded by 

3T ax -5?° dS ° < 0(Vf). 

This result is given without proof as it follows directly from Section 4.2] 
for rewards in the range [0, 1]. At the cost of slightly worsening the bound in 
terms of its multiplicative/ additive constants, the following generalizations can 
also be noted: 

• For rewards instead in the arbitrary rang e0 [a, b] the same bound can be 
shown by referring to [U Section 2.6]. 

• The choice of r\ in the above Lemma requires knowledge of the time horizon 
T. By referring to [HI Section 2.3] we can remove this restriction using a 
time-varying term rj t = -\/81nfc/i. 

• By referring to [5J Section 6.8] we can also extend this bound to the 
partial-information strategy Exp3. 

Finally, we should also note that this regret bound trivially holds for any strategy 
i, since g™ ax is the maximum. It is also important to note that this lemma holds 
for any choice of r\ , with rewards depending on the actual actions taken by Hedge. 
The particular choice of rewards we use for this proof have been selected in order 
to achieve the following derivations. 

For the next two lemmas we will refer the reader to [3TJ Lemma 5.1 and 5.3] 
for proof. We point out, however, that these two lemmas only depend on the 
underlying Gaussian process and as a result can be used separately from the 
GP-UCB framework. 

Lemma 2. Assume 5 2 G (0, 1), a finite sample space \A\ < oo, and (3 t — 
21og(|A|7r t /<5 2 ) where X^t 71 ^ 1 = 1 an< ^ n t > 0- Then with probability at least 
1 — $2 the absolute deviation of the mean is bounded by 

|/(x) - /it_i(x)| < v^ot_i(x) Vz e A,Mt > 1. 

3 To obtain rewards bounded within some range [a, b] we can assume that the additive 
noise et is truncated above some large absolute value, which guarantees bounded means. 
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In order to simplify this discussion we have assumed that the sample space 
A is finite, however this can also be extended to compact spaces [311 Lemma 
5.7]. 

Lemma 3. The information gain for points selected by the algorithm can be 
written as 

1 T 

I(yi-.T-Ji:T) = -J2log(l + a- 2 at 1 ( Xt )). 
t=i 

The following lemma follows the proof of [3TJ Lemma 5.4], however it can be 
applied outside the GP-UCB framework. Due to the slightly different conditions 
we recreate this proof here. 

Lemma 4. Given points Xt selected by the algorithm the following bound holds 
for the sum of variances: 

T 

^A<r t 2 (x 4 ) <Ci0t7t, 
t-i 

where C x = 2/ log(l + o~ 2 ). 

Proof. Because j3t is nondecreasing we can write the following inequality 

/W-l(*t) </^V~%-l(**)) 

The second inequality holds because the posterior variance is bounded by the 
prior variance, tr|_ 1 (x) < fc(x,x) < 1, which allows us to write 

log(l + cr 2 ) 

By summing over both sides of the original bound and applying the result of 
Lemma [3] we can see that 

T T 

X>*?-i(xt) < /3T2Ci^log(l + ( x- 2 ( 7 t 2 _ 1 (x t )) 
t=i t=i 

= prCiI(yi.. T ; fur)- 

The result follows by bounding the information gain by I(yi-T', fi-.r) < 7t ; which 
can be done for many common kernels, including the squared exponential (311 
Theorem 5]. □ 

Finally, the next lemma follows directly from [3U Lemma 5.2]. We will 
note that this lemma depends only on the definition of the GP-UCB acquisition 
function, and as a result does not require that points at any previous iteration 
were actually selected via GP-UCB. 
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Lemma 5. If the bound from Lemma^ holds, then for a point x^ CB proposed 
by GP- U CB with parameters fit the following bound holds: 



/(x ) - /ii-l(X t ) < VPt^t-i(x t J 

We can now combine these results to construct the proof of Theorem [T] 

Proof of Theorem^ With probability at least 1 — 81 the result of Lemma [T] 
holds. If we assume that GP-UCB is included as one of the acquisition functions 
we can write 

-<?* edse < 0(Vf) - 5 ucb 
and by adding Y^t=i /( x *) to both sides this inequality can be rewritten as 

T T 

/(**) - M*-i(x t ) < o(Vf) + J2 /(x*) - ^-i(x^ CB )- 
t=i *=i 

With probability at least 1 — 82 the bound from Lemma [2] can be applied to 
the left-hand-side and the result of Lemma [5] can be applied to the right. This 
allows us to rewrite this inequality as 



^/(x*)-/(x t )-VpVr t -i(x t ) 

T 

<o(Vf)+j2\/0to-t-i(xi 



*=i 

T 

JJCB^ 
V Wt-lV/ 

t=l 

which means that the regret is bounded by 

T 



a r = 5^/(x*)-/(x t ) 

T T 

t=i t=i 

T 



This final inequality follows directly from Lemma|4]by application of the Cauchy- 
Schwarz inequality. We should note that we cannot use Lemma[4]to further sim- 
plify the terms involving the sum over a t -i{y^ CB ). This is because the lemma 
only holds for points that are sampled by the algorithm, which may not include 
those proposed by GP-UCB. 

Finally, this result depends upon Lemmas [l] and [5] holding. By a simple 
union bound argument we can see that these both hold with probability at least 
1 — 81 —82, and by setting <5i =82 = 8/2 we recover our result. □ 
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B Synthetic test functions 



As there is no generally-agreed-upon set of test functions for Bayesian opti- 
mization in higher dimensions, we seek to sample synthetic functions from a 
known GP prior, similar to the strategy of Lizotte [22 . A GP prior is infinite- 
dimensional, so on a practical level for performing experiments we simulate this 
by sampling points and using the posterior mean as the synthetic objective test 
function. 

For each trial, we use an ARD kernel with 8 drawn uniformly from [0, 2] d . 
We then sample 100<i <i-dimensional points, compute K and then draw y ~ 
A/"(0, K). The posterior mean of the resulting predictive posterior distribution 
ju(x) (Section 2.1 1 is used as the test function. However it is possible that for 
particular values of and K, large parts of the space will be so far from the 
samples that they will form plateaus along the prior mean. To reduce this, we 
evaluate the test function at 500 random locations. If more than 25 of these are 
0, we recompute K using 200c? points. This process is repeated, adding 100c? 
points each time until the test function passes the plateau test (this is rarely 
necessary in practice). 

Using the response surface /i(x) as the objective function, we can then ap- 
proximate the maximum using conventional global optimization techniques to 
get /(x*), which permits us to use the gap metric for performance. 

Note that these sample points are only used to construct the objective, and 
are not known to the optimization methods. 
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Figure 5: (Best viewed in colour.) Results of experiments on the repeller control 
problem. The left-most plot displays 10 sample trajectories over 100 time-steps for a 
particular repeller configuration (not necessarily optimal). The right-most plot shows 
the progress of each of the described Bayesian optimization methods on a similar model, 
averaged over 25 runs. 
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